Curse of dimensionality and effect of sample size

Pelin Yurdadön

Ekim 2021

Libraries & Functions

Parameters

Dimension vs Coverage

Estimating pi

in 2D, points are separated like below

Since points cover only the 1/4 of a unit square, we need to multiply the fraction (we found previously for 2D) by 4.

for 3D

Remember the volume formula is (4/3)pir^3. When r =1, V = (4/3)*pi. On the other hand, volume of square = 8

Therefore our ratio will be ((4/3)*pi) / 8. In other words, we must multiply our fraction by 6 to get pi.

Note that the estimation power decreases as dimension is increased from 2 to 3.

Still, to be sure, lets simulate with different seeds. That is, lets reduce the variability in etimations due to the stochasticity embedded in the data generation.

Simulation provided better results, yet, 3D estimation is slightly worse than 2D estimation

Simulations for pi

3D requires more data points to estimate pi at least as good as 2D. However, the randomness in the data generation process seems to affect the accuracy of the estimation for both. For instance, when the size of the data set is 50000, 3D estimation of pi appears to be better than 2D estimation.

When a new data is generated with another seed, the effect of the data set on the estimation power is more obvious. While 2D pi estimation performs better with the data set obtained with seed 10 (N=5000), 3D pi estimation outperforms 2D estimation when seed is set to 0.

Nearest Neighborhood Simulation

Data Manipulation on Images

imager library works with cimg type data sets. Cimg consists of 4 dimensions, 2 for the pixes, 1 for the channel, and the last one for another dimension like time. Since I feel more comfortable with cimg type, I used both jpeg and imager libraries.

Image data structure

pelin photo is represented in three dimensions with "jpeg" library. The first two dimensions corresponds to pixels, while the last one addresses the RGB channels. Only difference between jpeg library and imager library is that imager library incorporates one more dimension, which represents more detailed description of an image.

Displaying channels of an image

Averages on channels

The column averages of each channel seem to follow the same pattern. This means that the information carried by pixels is independent from the channels. In other words, channels of an image have very similar information about the image, on average.

Image Manipulations

As the value of a pixel decreases, it seems that the brightness decreases as well (above). This also holds ofor each channel (below).

Noisy Image

Noise makes pixels more noticable in the image.